Swiftkey is a text messaging application that predicts the next word as you type a message. When the user types: "I went to the " : the application presents three options for what the next word might be. For example, the three words might be gym, store, restaurant.
In this project, we use R text mining tools to build a application that predicts the next word entered into an shiny application. We use data scraped from blogs, twitter and news sites to build a language model to perform prediction. We predict the next word by calculating the maximum likelihood estimate (MLE) based on the language model.
In a nutshell, here’s a summary of the data analysis performed in this report.
search node in a tree.search node.The Data Science Capstone Course provided text data from 3 data sources: blogs, twitter and news. The table below shows the number of lines from data source.
| filename | line_count | |
|---|---|---|
| 1 | en_US.blogs.txt | 899288 |
| 2 | en_US.news.txt | 1010242 |
| 3 | en_US.twitter.txt | 2360148 |
| 4 | Total | 4269678 |
We use a framework for text mining applications within R to transform the unstructured text to a structured document-by-term matrix format required for analysis. The first step is “clean” the text prior with a series of text processing functions. We use the following preprocessing steps.
Second, we tokenize the text into ngrams. For our analysis, a gram equals a whitespace delimited sequence of characters corresponding to an English word. The N in ngram corresponds to the number of words considered to be part of the same unit. For example: a 1-gram is new, 2-gram is new york, 3-gram is new york city and a 4-gram is new york city police.
Finally, we build a document-by-term matrix by transforming the ngrams to a bag-of-words model. The columns in a docterm matrix each unique ngram. The row represent a document and the frequency that each ngram appears in the document.
The 4,269,678 lines from the complete dataset can be memory intensive for the text mining tools and slow the analysis. To speed things up, we subsample 1% of the complete dataset and then work with the subsampled data for exploration and modeling. The subsampling implementation is in Appendix 1.
| source | num_lines | num_unique_words | mean_word_freq | median_word_freq |
|---|---|---|---|---|
| 23602 | 8040 | 20 | 9 | |
| blogs | 8993 | 12414 | 15 | 6 |
| news | 10103 | 12850 | 15 | 7 |
We see that the mean word frequency is nearly twice the median frequency for all data sources. The word frequency distribution has a long tail - as seen in the plot below. 937 of 22899 (4.1%) words cover 50% of all word instances in the dataset. The mean word frequency is heavily weighted by a few words that occur very frequently.
Now we will look at the top bigrams, trigrams and 4-grams in the dataset.
The top bigrams include places like: new york and common phrases like: happy birthday and last night.
The top trigrams include places like: new york city and common phrases like: happy mothers day and rock n roll.
The top 4-grams do not appear to be commonly occurring phrases but reflect frequently occuring lines in the corpus. I will investigate using a weighting technique like TD-IDF to account for frequently occuring lines.
We build a tree using the ngrams and compute MLE () using the Dirichlet-multinomial model. We use node.tree which can build a tree from a data.frame. Now lets perform a search for “data”.
Here are the maximum likelihood estimates. They show 6% likelihood that entry will be the next word: “data entry” has a frequency = 12 and “data” has a frequency of 198 - so the maximimum likelihood estimate is 6.1%.
results <- perform_search(ngram_tree, c("data"))
print(results)
## 12 10
## recommended_words "entry" "streams"
## likelihood "0.0606060606060606" "0.0505050505050505"
## 8 7
## recommended_words "recovery" "dating"
## likelihood "0.0404040404040404" "0.0353535353535354"
## 7
## recommended_words "personalize"
## likelihood "0.0353535353535354"
Then if we query for “data entry”, we search the tree the nodes “data” then “entry” and we will recommend the words “just” and “respond”.
results <- perform_search(ngram_tree, c("data", "entry"))
print(results)
## 6 6
## recommended_words "just" "respond"
## likelihood "0.5" "0.5"
This code collects a 1% sample using a “coin flip” to decide which lines to choose.
# sample the datasci dir
sample_capstone_data <- function(fn, outfn, sample_len=0.01) {
print(sprintf("Reading %s", fn))
lines <- readLines(fn)
set.seed(123)
print(sprintf("Read %s Length %s", fn, length(lines)))
lines_sample <- lines[rbinom(length(lines)*sample_len, length(lines), 0.5)]
print(sprintf("Writing %s. Length %s", outfn, length(lines_sample)))
write.csv(lines_sample, file=outfn, row.names=FALSE, col.names=FALSE)
}
sample_capstone_data("./data/final/en_US/en_US.twitter.txt",
"./data/final/en_US/sample/en_US.twitter.txt")
sample_capstone_data("./data/final/en_US/en_US.blogs.txt",
"./data/final/en_US/sample/en_US.blogs.txt")
sample_capstone_data("./data/final/en_US/en_US.news.txt",
"./data/final/en_US/sample/en_US.news.txt")